-
Notifications
You must be signed in to change notification settings - Fork 328
MMLU Redux #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
MMLU Redux #883
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @clefourrier ! LGTM with a question about whether we can simplify usage by providing a subset that evaluates all categories at once. Don't worry if this isn't possible.
Regarding the comparison to Qwen3, it's not clear from their paper if they used 0-shot for the post-trained models. Could you share the score one gets in that case?
"world_religions", | ||
] | ||
|
||
_mmlu_redux_2_tasks = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to provide an all
(or similar) subset so that one can run "lighteval|mmlu_redux_2:all|5|0"
instead of having to create the long list of categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should happen automatically if you don't select the subsets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lighteval|mmlu_redux_2|...
should do the trick on its own :)
Yep can do for one shot, let me wrap up my current PR and come back to you |
Above score was for base. For instruct, I'm getting (against 44.6 reported in non thinking mode) 0.2277, for both zero and 5-shot (which is a bit weird?) Will deep dive into the logs on Monday |
Probably good to also pass |
Getting (on MMLU-Redux-2, vs MMLU-Redux reported in the Qwen3 report)
vllm "model_name=Qwen/Qwen3-0.6B-Base,data_parallel_size=2,max_num_batched_tokens=100000,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912}" "lighteval|mmlu_redux_2|5|0"
(same as above with
"model_name=google/gemma-3-1b-pt,data_parallel_size=2,max_num_batched_tokens=2001,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:5},gpu_memory_utilization=0.8"
)It's impossible to know if the implem is exactly the same so we're imo within range (for ex, unsure if the report uses a logprob or generative approach - they use 5 shots which are likely not the same as ours and we know it can lead to up to 3 points diff, etc)
Not adding tests atm as we'll want to do all evals at the same time I think, with a setup running faster than our current one